Unsupervised learning is a class of machine learning algorithms to identify patterns or grouping structure in the data. Unlike supervised learning which relies on “supervised” information such as the dependent variable to guide modeling, unsupervised learning seeks to explore the structure and possible groupings of unlabeled data. This information will be useful to provide pre-processor for supervised learning.
Unsupervised learning has no explicit dependent variable of Y for prediction. Instead, the goal is to discover interesting patterns about the measurements on \(X_{1}\), \(X_{2}\), . . . , \(X_{p}\) and identify any subgroups among the observations.
Generally, in this section, the two general methods are introduced: Principal components analysis and Clustering.
Principal Components Analysis (PCA) produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.
The first principal component of a set of features \(X_1, X_2, . . . , X_p\) is the normalized linear combination of the features:
\[ Z_1 = \phi_{11}X_1 +\phi_{21}X_2 +...+\phi_{p1}X_p \]
that has the largest variance. By normalized, we mean that \(\sum_{j=1}^p\phi_{j1}^2 = 1\).
The elements \(\phi_{11}, . . . , \phi_{p1}\) are the loadings of the first principal component; together, the loadings make up the principal component loading vector, \(\phi_1 = (\phi_{11} \phi_{21} ... \phi_{p1})^T\) .
We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.
The K-means clustering method is to partition the data points into k groups such that the sum of squares from points to the assigned cluster center in each group is minimized.
Hierarchical clustering is an alternative approach which does not require a pre-specified or a particular choice of \(K\).
Hierarchical Clustering has an advantage that it produces a tree-based representation of the observations: Dendrogram
A dendrogram is built starting from the leaves and combining clusters up to the trunk. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.
1. Principal Component Analysis (PCA)
## Gentle Machine Learning
## Principal Component Analysis
# Dataset: USArrests is the sample dataset used in
# McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley.
# Murder numeric Murder arrests (per 100,000)
# Assault numeric Assault arrests (per 100,000)
# UrbanPop numeric Percent urban population
# Rape numeric Rape arrests (per 100,000)
# For each of the fifty states in the United States, the dataset contains the number
# of arrests per 100,000 residents for each of three crimes: Assault, Murder, and Rape.
# UrbanPop is the percent of the population in each state living in urban areas.
library(datasets)
library(ISLR)
arrest = USArrests
states=row.names(USArrests)
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
# Get means and variances of variables
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
# PCA with scaling
pr.out=prcomp(USArrests, scale=TRUE)
names(pr.out) # Five
## [1] "sdev" "rotation" "center" "scale" "x"
pr.out$center # the centering and scaling used (means)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
pr.out$scale # the matrix of variable loadings (eigenvectors)
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
pr.out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
dim(pr.out$x)
## [1] 50 4
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)
pr.out$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
pr.var=pr.out$sdev^2
pr.var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
pve=pr.var/sum(pr.var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')
## Use factoextra package
library(factoextra)
fviz(pr.out, "ind", geom = "auto", mean.point = TRUE, font.family = "Georgia")
fviz_pca_biplot(pr.out, font.family = "Georgia", col.var="firebrick1")
2. K-Means Clustering
## Computer purchase example: Animated illustration
## Adapted from Guru99 tutorial (https://www.guru99.com/r-k-means-clustering.html)
## Dataset: characteristics of computers purchased.
## Variables used: RAM size, Harddrive size
library(dplyr)
library(ggplot2)
library(RColorBrewer)
computers = read.csv("https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv")
# Only retain two variables for illustration
rescaled_comp <- computers[4:5] %>%
mutate(hd_scal = scale(hd),
ram_scal = scale(ram)) %>%
select(c(hd_scal, ram_scal))
ggplot(data = rescaled_comp, aes(x = hd_scal, y = ram_scal)) +
geom_point(pch=20, col = "blue") + theme_bw() +
labs(x = "Hard drive size (Scaled)", y ="RAM size (Scaled)" ) +
theme(text = element_text(family="Georgia"))
# install.packages("animation")
library(animation)
set.seed(2345)
library(animation)
# Animate the K-mean clustering process, cluster no. = 4
kmeans.ani(rescaled_comp[1:2], centers = 4, pch = 15:18, col = 1:4)
saveGIF(
kmeans.ani(rescaled_comp[1:2], centers = 4, pch = 15:18, col = 1:4) ,
movie.name = "kmeans_animated.gif",
img.name = "kmeans",
convert = "magick",
cmd.fun,
clean = TRUE,
extra.opts = ""
)
## [1] TRUE
animated K-means output
## Iris example
# Without grouping by species
ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_point() +
theme_bw() +
scale_color_manual(values=c("firebrick1","forestgreen","darkblue"))
# With grouping by species
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point() +
theme_bw() +
scale_color_manual(values=c("firebrick1","forestgreen","darkblue"))
# Check k-means clusters
## Starting with three clusters and 20 initial configurations
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
## K-means clustering with 3 clusters of sizes 52, 48, 50
##
## Cluster means:
## Petal.Length Petal.Width
## 1 4.269231 1.342308
## 2 5.595833 2.037500
## 3 1.462000 0.246000
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [149] 2 2
##
## Within cluster sum of squares by cluster:
## [1] 13.05769 16.29167 2.02200
## (between_SS / total_SS = 94.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
class(irisCluster$cluster)
## [1] "integer"
# Confusion matrix
table(irisCluster$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 0 48 4
## 2 0 2 46
## 3 50 0 0
irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point() +
scale_color_manual(values=c("firebrick1","forestgreen","darkblue")) +
theme_bw()
actual = ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point() +
theme_bw() +
scale_color_manual(values=c("firebrick1","forestgreen","darkblue")) +
theme(legend.position="bottom") +
theme(text = element_text(family="Georgia"))
kmc = ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point() +
theme_bw() +
scale_color_manual(values=c("firebrick1", "darkblue", "forestgreen")) +
theme(legend.position="bottom") +
theme(text = element_text(family="Georgia"))
library(grid)
library(gridExtra)
grid.arrange(arrangeGrob(actual, kmc, ncol=2, widths=c(1,1)), nrow=1)
library(readr) wine <- read_csv(“https://raw.githubusercontent.com/datageneration/gentlemachinelearning/master/data/wine.csv”)
wine_subset <- scale(wine[ , c(2:4)])
wine_cluster <- kmeans(wine_subset, centers = 3, iter.max = 10, nstart = 25) wine_cluster
wssplot <- function(data, nc=15, seed=1234){ wss <- (nrow(data)-1)*sum(apply(data,2,var)) for (i in 2:nc){ set.seed(seed) wss[i] <- sum(kmeans(data, centers=i)$withinss)} plot(1:nc, wss, type=“b”, xlab=“Number of Clusters”, ylab=“Within groups sum of squares”) }
wssplot(wine_subset, nc = 9)
wine_cluster\(cluster = as.factor(wine_cluster\)cluster) pairs(wine[2:4], col = c(“firebrick1”, “darkblue”, “forestgreen”)[wine_cluster$cluster], pch = c(15:17)[wine_cluster$cluster], main = “K-Means Clusters: Wine data”) table(wine_cluster$cluster)
library(factoextra) fviz_nbclust(wine_subset, kmeans, method = “wss”)
wine.km <- eclust(wine_subset, “kmeans”, nboot = 2)
wine.km
wine.km$nbclust fviz_nbclust(wine_subset, kmeans, method = “gap_stat”)
fviz_silhouette(wine.km)
fviz_cluster(wine_cluster, data = wine_subset) + theme_bw() + theme(text = element_text(family=“Georgia”))
fviz_cluster(wine_cluster, data = wine_subset, ellipse.type = “norm”) + theme_bw() + theme(text = element_text(family=“Georgia”))
**3. Hierarchical Clustering**
```r
## Hierarchical Clustering
## Dataset: USArrests
# install.packages("cluster")
arrest.hc <- USArrests %>%
scale() %>% # Scale all variables
dist(method = "euclidean") %>% # Euclidean distance for dissimilarity
hclust(method = "ward.D2") # Compute hierarchical clustering
# Generate dendrogram using factoextra package
fviz_dend(arrest.hc, k = 4, # Four groups
cex = 0.5,
k_colors = c("firebrick1","forestgreen","blue", "purple"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle (cluster) around groups,
main = "Cluster Dendrogram: USA Arrest data"
) + theme(text = element_text(family="Georgia"))
References
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013 An introduction to statistical learning. Vol. 112. New York: Springer.